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Large scale agglomerative clustering is hindered by computational burdens. We 
propose a novel scheme where exact inter-instance distance calculation is re- 
1^ , placed by the Hamming distance between Kernelized Locality-Sensitive Hashing 

■ (KLSH) hashed values. This results in a method that drastically decreases com- 

If) I putation time. Additionally, we take advantage of certain labeled data points via 

CO . distance metric learning to achieve a competitive precision and recall comparing 

to K-Means but in much less computation time. 
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1 Introduction 

> : 

I Our proposed research topic is to do clustering on a scalable dataset from a semi-supervised ap- 

5-H ■ proach based on hashing methods. In particular, our goal is to explore the underlying data distribu- 

tion by clustering the data points and differentiating the classes. When a small set of labeled data 
that come from only a subset of the classes is given, we want to find out the whole data distribution 
for a complete set of classes. For example, we are given a set of labels of two classes, can we sep- 
arate these two classes well and at the same time discover the existence of a third class. It requires 
using the information from the labeled data to find a transformation metric that can split the two 
classes well; and after this data transformation, we can discover that there is a third class exists. 
Suppose there is a handwritten digit recognition task and the dataset contains digits '2', '7' and '4'. 
If a general agglomerative clustering is run, it might end up with 2 clusters that '2' and '7' in one 
cluster and '4' in the other, due to the similarity of their shapes. However, when a small labeled 
set of classes '2' and '7' is given, we can learned a degree of granularity for similarity comparison. 
By using a data transformation that maximally can split '2' and '7' into two clusters, we are able to 
identify the existence of another cluster, digit '4' . Because agglomerative clustering suffers from its 
computation inefficiency, a major contribution of this paper is to introduce a machine learned hash- 
ing method - kernelized locality-sensitive hashing (KLSH) - into agglomerative clustering. This 
results in an efficient computation in clustering for large-scale dataset. 

Our paper is structured as follows. We provide background study and related work in section 2. 
Section 3 presents our algorithms for distance metric learning and KLSH clustering. Section 4 
describes the experiments with a discussion of the results, followed by conclusions in section 5. 
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2 Related Work 



There has been much previous work on cluster seeding to address the limitation that iterative clus- 
tering techniques (e.g. K-Means and Expectation Maximization (EM)) are sensitive to the choice 
of initial starting points (seeds). The problem addressed is how to select seed points in the absence 
of prior knowledge. Kaufman and Rousseeuw [ 1 ] propose an elaborate mechanism: the first seed is 
the instance that is most central in the data; the rest of the representatives are selected by choosing 
instances that promise to be closer to more of the remaining instances. Pena et al. |2| empirically 
compare the four initialization methods for the K-Means algorithm and illustrate that the random 
and Kaufman initializations outperform the other two, since they make K-Means less dependent on 
the initial choice of seeds. In K-Means-H- | 3|, the random starting points are chosen with specific 
probabilities: that is, a point p is chosen as a seed with probability proportional to p's contribution 
to the overall potential (defined by the sum of squared distances between each point and the closest 
center). By augmenting K-Means using this simple, randomized seeding technique, K-Means-H- is 
6'(log K) competitive with the optimal clustering. Bradley and Fayyad |4| propose refining the initial 
seeds by taking into account the modes of the underlying distribution. This refined initial seed en- 
ables the iterative algorithm to converge to a better local minimum. Semi-supervised learning is also 
seen as unsupervised learning guided by constraints. Noticed that clustering is heavily dependent 
on distance metrics and a particular algorithm is an executor to follow the rules, |5 1 pointed out the 
desire to use a systematic way to learn distance metric for clustering from labeled data. It is based 
on posing metric learning as a convex optimization problem. 

When the data size is growing exponentially, hashing is a technique especially good at solving 
large scale problems. [6 1 described Locality-Sensitive Hashing (LSH) method, which is an efficient 
algorithm for the approximate and exact nearest neighbor problem. Their goal is to preprocess a 
dataset of objects (e.g. images) so that later, given a new query object, one can quickly return the 
dataset object that is most similar to the query. The technique is of significant interest in a wide 
variety of areas of unsupervised learning. Hierarchical clustering tries to solve a similar problem, 
from another perspective. By iteratively finding nearest neighbors, it groups data into clusters. 
Kemelized LSH is later proposed by |7 | for fast image search. It generalizes LSH to accommodate 
arbitrary kernel functions, making it possible to preserve the algorithm's sub-linear time similarity 
search guarantees for a wide class of useful similarity functions. 



In this section, we describe our methods to solve a large scale semi-supervised learning problem by 
first introducing the distance learning metrics, and then our fast agglomerative clustering method 
based on kernelized locality-sensitive hashing (KLSH). 

3.1 Distance Metric Learning 

Under the circumstances that the data given to us has a few labeled points, and we know which 
points for sure belongs to the same or different classes. We have a similarity and a dissimilarity 
matrix S and D respectively. For entry Si,j in similarity matrix 5, Si.j = 1 if data xi and Xj are in 
the same class and otherwise. Similarly for dissimilarity matrix D. Based on |5 1, we try to learn a 
distance metric \ \x — y\\A — \/ {x — y)'^A{x — y), where x, y are two data points, A is a positive 
semi-definite matrix of distance parameters among data points. The idea is to minimize the distance 
between similar points while keeping dissimilar points apart. 



3 Methods 





It can be solved efficiently using constrained Newton's descent on the objective function 
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Algorithm 1 Semi-Supervised Clustering with Hashing 
Input: DataLabeled(limited), Data xi, ...,xn- 
Stepl: Learn distance metric A from labeled (limited) data 
Step2: Build A-Distance KLSH table (Algorithm|2]). 
Step3: 

Initialization: 

• Let cluster distribution Rq — {{xi}, i — 1, N}, i.e. each data point is an individual cluster 
Q. 

• Let proximity matrix Pq — P{X) of hash keys. 

• Set t = 1. 
repeat 

•t=t+l 

• Find Ci, Cj such that d{Ci, Cj) = mirir^s^i N.r^sd{Cr, Cg)- 

• Merge Ci, Cj into a single cluster Cq and form Rt = (i?t-i — Ci, Cj) U Cq. 

• Define the proximity matrix Pt from Pt-i by (a) deleting the two rows and columns that 
corresponding to the merged clusters and (b) adding a row and a column of the new cluster 

until (The remaining number of clusters is equal to a specified fc; or the inconsistency coefficient 
exceeds a threshold.) 

Step4: Retrieve actual data instances from KLSH hash table for the corresponding clusters 



Algorithm 2 Build A-Distance KLSH Table 
Input: Data Xi, xjv, distance parameter A. 
Stepl: Randomly select p points from data, denoted 

Build kernel K{i,j) — exp{~{d{xi, Xj)'^ /a'^), for i,j — l,...,p, where d{x,y) = 
\/ {x - y)' A{x - y). 

Stepl: Apply SVD to K, suppose K = UY.U'^ . K'^/^ = UT.-^/^U'^ . 

Step3: Form a p-dim vector eg, where t dimensions are 1 while others are 0. These t dimensions 
are chosen randomly. 

Step4: w = K^^^^es- For any x, the bit is created as h{x) — sign{Yj\^-^Wik{x, Xi)). 



^ \\x^- Xj\\\-log{ ^ -XjlU). 

This semi-supervised part is to learn a distance metric for data transformation before the main ag- 
glomerative clustering. 

3.2 Clustering with KLSH 

Curse of dimensionahty is a weU-known problem for learning on large scale datasets. It is related 
to the fact that the cost of computation grows exponentially with the increase of data dimensions, or 
the number of data instances. This is a problem directly affects clustering approaches that based on 
density estimation in input space. 

For instance, in K-Means or general agglomerative clustering, the cost in iteratively estimating new 
centroid locations and re-arranging data instances to clusters exerts a significant burden on the per- 
formance. This happens especially to dataset in high dimension, where frequently computing inter- 
instance distances are highly expensive. 

Our proposed semi-supervised clustering algorithm using kernelized locality-sensitive hashing 
(KLSH) in Algorithm [1] aims to solve the large scale agglomerative clustering problem. It first 
learn a distance metric A from a small set of labeled data (step 1 in AlgorithmlT]!. The second step is 
to build KLSH table that map the data in to hashed bits. In the rest of the procedure, an agglomera- 
tive clustering is performed. Instead of explicitly computing inter-instance distances, the clustering 
is done based on the KLSH-hashed data points by measuring their Hamming distance. Such kernel- 
ized locality-sensitive hashing method has a high probability of preserving neighborhoods so it's a 
reasonable substitute for the exact inter-instance distances. 
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Table 1: Experiment results that compare four methods: (1) K-Means, (2) K-Means with Distance- 
metric Learning, (3) Agglomerative clustering using KLSH and (4) Agglomerative clustering using 
KLSH with Distance-metric Learning. Precision, recall and computation time is reported. They all 
run on data underlying 10 classes and the hash code is 32-bit. 



# Inst 


K-Means 


K-Means w/ DL 


Aggl. KLSH 


Aggl. KLSH w/ DL 


Pre 


Rec 


Time 


Pre 


Rec 


Time 


Pre 


Rec 


Time 


Pre 


Rec 


Time 


5000 


.590 


.564 


13.246 


.537 


.496 


15.647 


.573 


.305 


2.155 


.631 


.272 


2.466 


10000 


.568 


.540 


46.398 


.580 


.556 


33.736 


.520 


.336 


7.255 


.613 


.250 


5.246 


15000 


.574 


.530 


69.252 


.556 


.539 


186.469 


.584 


.180 


13.843 


.610 


.156 


8.077 


20000 


.589 


.563 


79.499 


.455 


.448 


112.178 


.609 


.355 


3.052 


.617 


.292 


18.070 


30000 


.523 


.503 


164.853 


.552 


.541 


139.773 


.624 


.235 


58.646 


.548 


.306 


23.136 


50000 


.560 


.531 


339.599 


.565 


.530 


333.313 


.579 


.230 


126.280 


.590 


.252 


122.558 



4 Experiments 

Our experiment is based on the MNIST dataset of handwritten digits. We evaluate our KLSH ag- 
glomerative clustering algorithm via a comparison to K-Means. 

4.1 Datasets 

We obtained handwritten digits from the MNIST data repository. There are 10 classes of rasterized 
images (corresponding to digits from '0' to '9')- We used up to 50,000 data points for experiments. 

4.2 Experiment Setup 

We ran experiments using both K-Means and KLSH agglomerative clustering with and without 
distance metric learning. Hash string length, the number of classes, and the number of data points 
are varied one at a time. We report precision, recall and the computation time. All the experiments 
were done on a machine with 8-core Intel processors of 2.8 GHz and 8 GB of RAM. 

4.3 Results and Analysis 

Tables [T][3] summarize our results, and the followings are several trends to notice. 

First in Table [T] we observed that KLSH agglomerative clustering can achieve the same level of 
precision for a fraction of the computational cost. The downside is that recall is caused by the 
factor of 2. The decrease in recall is caused by the fact that KLSH cannot recover all of the points 
in the nearest neighborhood. The addition of distance metric learning has noticeable benefits on 
performance for KLSH Agglomerative Clustering. 

In Table 121 we analyzed the effect of an increase in the number of classes (while fixing the number 
of data points) on precision, recall, and computation time. Precision remains constant while recall 
decreases. The computational costs remains relatively independent of the number of clusters. 

In Table [3] we analyzed the effect of hash string length on clustering validity. Increasing the length 
of hash string increases both the precision and recall. 

It is also able to adjust the tradeoff between efficiency and effectiveness. Notice that even if we use 
m = 32 bit binary hash code, there are still 2'^^ possible outcomes. If the hashing split data well, the 
number of entries of the table will still be very large. It increases the accuracy of clustering results 
but meanwhile leads to a higher computation cost during agglomerative clustering. 

According to the results, clustering with KLSH has superior performance when the dataset is large 
and the number of real clusters is small. Comparing to K-Means, it has large promising improve- 
ment on speed. When true cluster number is not large, it achieves high performance on both speed 
and accuracy. Especially in a lower level of the linkage tree, clustering with bias (distance metric 
learning) can immediately correctly cluster similar data instances. 
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Table 2: Compare the performance of various number of underlying classes for agglomerative clus- 
tering using KLSH with distance metric learning. In this case, data size is 20,000 and the hash code 
is 32-bit. With an increase in the number of classes, the precision remains constant while recall 
decreases 



# Classes 


Pre 


Rec 


Time 


4 


.714 


.465 


16.527 


5 


.629 


.355 


19.226 


6 


.702 


.507 


18.747 


7 


.611 


.317 


21.675 


8 


.654 


.354 


21.540 


9 


.603 


.284 


24.024 


10 


.617 


.292 


18.070 



Table 3: Compare the performance of various number of hash code bits for agglomerative clustering 
using KLSH with distance metric learning. In this case, data size is 20,000 with underlying 10 
classes. Increasing the length of hash string increases both the precision and recall. 



# Bits 


Pre 


Rec 


Time 


8 


.402 


.245 


0.043 


16 


.599 


.111 


1.000 


32 


.617 


.292 


18.070 


64 


.635 


.380 


92.452 



5 Conclusions 

General hierarchical clustering methods cannot scale well on large dataset due to the exponentially 
growing number of calculations on inter-instance distances. Kernelized locality-sensitive hashing 
(KLSH) provides a high probability of preserving neighborhoods and it's a reasonable substitute for 
the exact inter-instance distances. Our proposed KLSH agglomerative clustering alleviates the prob- 
lem by calculating a reduced-sized Hamming distance and achieves efficient clustering computation. 
The incorporation of distance metric learning marginally improves the precision and recall. 
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